The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens
transcriptome.
load("./data_09_04_2024.RData")
#Data:
# myE_ge : raw gene expression count matrix
# info : sample annotation (data-frame)
# ttg : rows (genes) annotation
# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge <- myE_ge[,sel_samples]
info <- info[sel_samples,]
info$group <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))
#Make some nice colors to facilitate the visualisation of time-points
mytime <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))How many samples is there in your count data?
Does this correspond to your experimental protocol?
Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file.
What are the covariates?
how many rows?
Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.
#View(info)
#nrow(info)
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))## [1] 0
How does your data look-like?
par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")Analysis of the variance of the gene count across the \(N\) samples: \(\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) and \(\mu=\frac{1}{n}\sum_{i=1}^n\)x_i:
Letās have a close look at how the variance in gene expression scale with average and then correct it with some transformation.
#calculate mean and variance of the rows
row_avg<-apply(myE_ge,1,mean)
row_var <-apply(myE_ge,1,var)
#Log-transform your count data
myE_gel <- log2(1+myE_ge)
#DESeq2 variance stabilisation
vsd <- DESeq2::varianceStabilizingTransformation(matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE))
par(mfrow=c(1,3))
#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()
plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()
plot(apply(vsd,1,mean),apply(vsd,1,var),las=1,main="DESeq2 variance stabilised",pch=19,col=rgb(0,0,0,0.05),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()Letās first have a look at the distribution of gene expression for a few samples:
par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribtution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.
We can now select for the reliably expressed genes in each sample:
Letās first look at the ditribution of the gene count across a few
samples:
There are several options for normalisation. Scaling factors,
quantile normalisation etc.
What is the effect on the gene count?
Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.
Letās first compare the clustering of the samples using the Manhattan
distance and Ward D algorithm:
Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Letās compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix:
There are many useful R packages to perform differential gene expression analysis of bulk RNA-sequencing data usge as Sleuth, edgeR and DESeq2.
#Volcano plot
VolcanoPlot <- function(myDGE=res, col_log2FC=res$log2FoldChange, col_pval=res$padj, timepoint=3){
mycol_genes <- rep(rgb(0.7,0.7,0.7,0.2))
mycol_genes[abs(col_log2FC)>=1.0&col_pval<=0.05]<- rep(rgb(0.0,0.0,0.0,0.2))
plot(col_log2FC,-log10(col_pval),pch=19,cex=0.3,col=mycol_genes,xlab="log2FC",ylab="-log10(P-value)", main=paste("DIV_", timepoint, "vs_0"), frame=FALSE)
abline(h=-log10(0.01),lty=2,col="grey")
abline(v=c(-1,1),lty=2,col="grey")
}
#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
gostres_diff <- gost(query = genes_list,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE, sources=c("GO:BP", "GO:MF","KEGG"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}
#Create a venn diagram from two gene lists
## This is an example - not meant to run
vennDiag <- function(genes_lists){
genes_comparisons <- do.call(what=cbind,args=lapply(genes_lists,function(Z)return(rownames(myE_gen)%in%Z)))
colnames(genes_comparisons)<-c("cond1","cond2")
vennDiagram(genes_comparisons,main="blasdfas")
}Here is a tutorial how to run DESeq2 on this data:
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
#How to fit DEseq2:
dds <- DESeq2::mDESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
colData = info,
design= ~ DIV + Fraction)
dds <- DESeq2::DESeq(dds)
mycoeff <- resultsNames(dds) # lists the coefficients
#How to extract a result (for a specific comparison: if comparison corresponds to mycoeff index = "DIV_14_vs_0":
res <- results(dds, name="DIV_14_vs_0")
genes_up <- as.character(rownames(res))[res$log2FoldChange>1.0&res$padj<0.01]
genes_do <- as.character(rownames(res))[res$log2FoldChange<(-1.0)&res$padj<0.01]Here is a tutorial how to run EdgeR with Quasi-likelihood F-tests on this data:
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
design <- model.matrix(~info$group)
y <- estimateDisp(y,design)
#To perform quasi-likelihood F-tests:
fit <- glmQLFit(y,design)
mycoefs <- colnames(fit$design)
#The fit has 12 parameters. The first is the baseline level of group 1 (Nuclear_0). The remaining are the difference between Nuclear_0 and the other groups.
#To compare Nuclear_3 vs Nuclear_0:
qlf.nuc3 <- glmQLFTest(fit, coef=2)
res<- topTags(qlf.nuc3,n=sum(is_expressed_global))
#To compare Nuclear_7 vs Nuclear_3
qlf.nuc3.nuc7 <- glmQLFTest(fit, contrast=c(0,-1,1,rep(0,9)))
res<- topTags(qlf.nuc3.nuc7,n=sum(is_expressed_global))
res <- topTags(glmQLFTest(fit, coef=3),n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]Here is a tutorial how to run EdgeR with Likelihood ratio tests on this data:
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
design <- model.matrix(~info$group)
y <- estimateDisp(y,design)
#To perform likelihood ratio tests:
fit <- glmFit(y,design)
mycoefs <- colnames(fit$design)
#The fit has 12 parameters. The first is the baseline level of group 1 (Nuclear_0). The remaining are the difference between Nuclear_0 and the other groups.
#To compare Nuclear_3 vs Nuclear_0:
lrt.nuc3 <- glmLRT(fit,coef=2)
res<- topTags(lrt.nuc3,n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.
In this task, you will perform a differential gene expression using Deseq2.
Since, we have comparison between 0 time point and every other time point, which are 3, 7, 14, 22 and 35, we expect 2 different plots, one for upregulated genes and the other one with downregulated genes, each with 5 bars in the box plot. For obtaining the sets of upregulated and downregulated genes for each time point we used DESeq2 package. When selecting the design formula, we opted for the DIV and Fraction columns. Since the design formula tells the statistical software for which sources of variation to test for, besides taking your factors of interest (which is DIV in this case), it is also important to include any additional covariates that are sources of variation (therefore addition of the Fraction).
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
#How to fit DEseq2:
dds <- DESeq2::DESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
colData = info,
design= ~ Fraction + DIV)
dds <- DESeq2::DESeq(dds)
mycoeff <- resultsNames(dds) # lists the coefficients
#How to extract a result (for a specific comparison: if comparison corresponds to mycoeff index = "DIV_14_vs_0":
#Extracting the result that correspond to index = "DIV_3_vs_0"
res_3 <- results(dds, name="DIV_3_vs_0")
genes_up_3 <- length(as.character(rownames(res_3))[res_3$log2FoldChange>1.0&res_3$padj<0.01])
genes_do_3 <- length(as.character(rownames(res_3))[res_3$log2FoldChange<(-1.0)&res_3$padj<0.01])
#Extracting the result that correspond to index = "DIV_7_vs_0"
res_7 <- results(dds, name="DIV_7_vs_0")
genes_up_7 <- length(as.character(rownames(res_7))[res_7$log2FoldChange>1.0&res_7$padj<0.01])
genes_do_7 <- length(as.character(rownames(res_7))[res_7$log2FoldChange<(-1.0)&res_7$padj<0.01])
#Extracting the result that correspond to index = "DIV_14_vs_0"
res_14 <- results(dds, name="DIV_14_vs_0")
genes_up_14 <- length(as.character(rownames(res_14))[res_14$log2FoldChange>1.0&res_14$padj<0.01])
genes_do_14 <- length(as.character(rownames(res_14))[res_14$log2FoldChange<(-1.0)&res_14$padj<0.01])
#Extracting the result that correspond to index = "DIV_22_vs_0"
res_22 <- results(dds, name="DIV_22_vs_0")
genes_up_22 <- length(as.character(rownames(res_22))[res_22$log2FoldChange>1.0&res_22$padj<0.01])
genes_do_22 <- length(as.character(rownames(res_22))[res_22$log2FoldChange<(-1.0)&res_22$padj<0.01])
#Extracting the result that correspond to index = "DIV_35_vs_0"
res_35 <- results(dds, name="DIV_35_vs_0")
genes_up_35 <- length(as.character(rownames(res_35))[res_35$log2FoldChange>1.0&res_35$padj<0.01])
genes_do_35 <- length(as.character(rownames(res_35))[res_35$log2FoldChange<(-1.0)&res_35$padj<0.01])
upregulated <- c(genes_up_3, genes_up_7, genes_up_14, genes_up_22, genes_up_35)
downregulated <- c(genes_do_3, genes_do_7, genes_do_14, genes_do_22, genes_do_35)
identification <- c("3_vs_0", "7_vs_0", "14_vs_0", "22_vs_0", "35_vs_0")
As we can observe from the plots above, numbers of both upregulated and
downregulated genes increase over time. These changes in numbers are
expected since we are comparing the numbers with the respect to the 0
time point, which corresponds to the undifferentiated state of the cell
(stem cell state). While it progresses towards the differentiated cell
state, many genes gain activity with the transition to the new cell
purpose, but also many lose their activity, thereby such a plot. It is
interesting to note that the biggest difference in numbers for both
upregulated and downregulated genes is achieved between time point 14
and time point 22.
With volcano plot besides differentially expressed genes, we can also
observe their statistical significance (p-value) versus magnitude of
change (fold change). The most upregulated genes are towards the right,
the most downregulated genes are towards the left, and most
statistically significant genes are towards the top. It is important to
note that \(-log_{10}(P_{value})\)
increase with the time, meaning that genes in the last plot (time point
35 compared to the time point 0) have higher values compared to the
other.
The plot we obtained displays enrichment of gene ontology (GO) terms. The x-axis represents functional terms that are grouped and color-coded. From GO library: molecular function (MF), biological process (BP), cellular components (CC) and KEGG pathways. The y-axis shows the adjusted enrichment p values in ālog10 (Padj). For upregulated genes, top GO terms associated with MF, BP and KEGG are:
If we take a closer look into the top GO terms associated with the upregulated genes, we can come to the conclusion that the vast majority of them is associated with the system development which is expected since the cells on which we are performing RNA-seq and differential gene expression analysis are differentiating from human induced pluripotent stem cells towards the electrophysiologicaly active motor neurons. Also, we observe upregulation of pathway associated with calcium ion binding which is important for functioning of motor neurons, as well as wnt-protein binding which is secreted growth factor involved in signaling.
When it comes to downregulated genes, top GO terms associated with MF, BP and KEGG are:
Do the same as task 1 but using EdgeR ā Likelihood ratio tests. Comment on the differences between the techniques.
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
# I'm not sure about this, but I think we should change group to DIV.
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y <- DGEList(counts=mycountData[is_expressed_global,],group=info$DIV)
design <- model.matrix(~info$Fraction + info$DIV)
y <- estimateDisp(y,design)
#To perform likelihood ratio tests:
fit <- glmFit(y,design)
mycoefs <- colnames(fit$design)
#We have 6 coefficients. Baseline level of DIV_0 is coeff 1. The remaining are the difference between DIV_0 and the other groups.
#To compare DIV_3 vs DIV_0:
lrt.div3 <- glmLRT(fit,coef=3)
res_3R <- topTags(lrt.div3,n=sum(is_expressed_global))$table
genes_up_3R <- length(as.character(rownames(res_3R))[res_3R$logFC>1.0&res_3R$FDR<0.01])
genes_do_3R <- length(as.character(rownames(res_3R))[res_3R$logFC<(-1.0)&res_3R$FDR<0.01])
#To compare DIV_7 vs DIV_0:
lrt.div7 <- glmLRT(fit,coef=4)
res_7R <- topTags(lrt.div7,n=sum(is_expressed_global))$table
genes_up_7R <- length(as.character(rownames(res_7R))[res_7R$logFC>1.0&res_7R$FDR<0.01])
genes_do_7R <- length(as.character(rownames(res_7R))[res_7R$logFC<(-1.0)&res_7R$FDR<0.01])
#To compare DIV_14 vs DIV_0:
lrt.div14 <- glmLRT(fit,coef=5)
res_14R <- topTags(lrt.div14,n=sum(is_expressed_global))$table
genes_up_14R <- length(as.character(rownames(res_14R))[res_14R$logFC>1.0&res_14R$FDR<0.01])
genes_do_14R <- length(as.character(rownames(res_14R))[res_14R$logFC<(-1.0)&res_14R$FDR<0.01])
#To compare DIV_22 vs DIV_0:
lrt.div22 <- glmLRT(fit,coef=6)
res_22R <- topTags(lrt.div22,n=sum(is_expressed_global))$table
genes_up_22R <- length(as.character(rownames(res_22R))[res_22R$logFC>1.0&res_22R$FDR<0.01])
genes_do_22R <- length(as.character(rownames(res_22R))[res_22R$logFC<(-1.0)&res_22R$FDR<0.01])
#To compare DIV_35 vs DIV_0:
lrt.div35 <- glmLRT(fit,coef=7)
res_35R <- topTags(lrt.div35,n=sum(is_expressed_global))$table
genes_up_35R <- length(as.character(rownames(res_35R))[res_35R$logFC>1.0&res_35R$FDR<0.01])
genes_do_35R <- length(as.character(rownames(res_35R))[res_35R$logFC<(-1.0)&res_35R$FDR<0.01])
upregulatedR <- c(genes_up_3R, genes_up_7R, genes_up_14R, genes_up_22R, genes_up_35R)
downregulatedR <- c(genes_do_3R, genes_do_7R, genes_do_14R, genes_do_22R, genes_do_35R)
identificationR <- c("3_vs_0", "7_vs_0", "14_vs_0", "22_vs_0", "35_vs_0")With the edgeR tehnique, we can observe similar results to the ones achieved with the DESeq2 method. There is still the largest difference in numbers of upregulated and downregulated genes between time point 22 and time point 14. On the other hand, we can also notice slight decrease in numbers of upregulated genes in edgeR tehnique, as well as slight increase in numbers of downregulated genes.
Regarding the Volcano plot, we again observe similar results as in the
previous case, with slight difference being that we are plotting \(-log_{10}FDR\) instead of adjusted
p-value.
genes_up_3R_names = as.character(rownames(res_3R))[res_3R$logFC>1.0&res_3R$FDR<0.01]
genes_do_3R_names = as.character(rownames(res_3R))[res_3R$logFC<(-1.0)&res_3R$FDR<0.01]
par(mfrow=c(1,2))
# Upregulated genes
GO_analysis(genes_up_3R_names)GO enrichment analysis results in the same pathways being enriched with both edgeR and DESeq2 methods.